Normalized Mutual Information to evaluate overlapping community finding algorithms

نویسندگان

  • Aaron F. McDaid
  • Derek Greene
  • Neil J. Hurley
چکیده

Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical. A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2]. A C++ implementation is available online. 1 In a non-overlapping scenario, each node belongs to exactly one cluster. We are looking at overlapping, where a node could belong to many communities, or indeed to no clusters. Such a set of clusters has been referred to as a cover in the literature, and this is the terminology that we will use. For a good introduction to our problem of comparing covers of overlapping clusters, see [2]. They describe the Rand index, which is defined only for disjoint (non-overlapping) clusters, and then show how to extend it to overlapping clusters. Each pair of nodes is considered and the number of clusters in common between the pair is counted. Even if a typical node is in many clusters, it’s likely that a randomly chosen pair of nodes will have zero clusters in common. These counts are calculated for both covers and the Omega index is defined as the proportion of pairs for which the shared-cluster-count is identical, subject to a correction for chance. I. MUTUAL INFORMATION Meila [3] defined a measure based on mutual information for comparing disjoint clusterings. Lancichinetti et al. [1] proposed a measure also based on mutual information, extended for covers. This measure has become quite popular for comparing community finding algorithms in social network analysis. It is this measure we are primarily concerned with there, and we will refer to it as NMILFKafter the authors’ initials. We are proposing to use a different normalization to that used in NMILFK , but first we will define the non-normalized measure which is based very closely on that in NMILFK . You may want to compare this to the final section of Lancichinetti et al. [1]. Given two covers, X and Y , we must first see how to measure the similarity between a pair of clusters. X and Y are matrices of cluster membership. There are n objects. The first cover has KX clusters, and hence X is an n ×KX matrix. Y is an n ×KY matrix. Xim tells us whether node m is in cluster i in cover X . To compare cluster i of the first cover to cluster j of the second cover, we compare the vectors Xi and Yj . These are vectors of ones and zeroes denoting which clusters the node is in. • a = ∑n m=1[Xim = 0 ∧ Yjm = 0] • b = ∑n m=1[Xim = 0 ∧ Yjm = 1] • c = ∑n m=1[Xim = 1 ∧ Yjm = 0] • d = ∑n m=1[Xim = 1 ∧ Yjm = 1] 1https://github.com/aaronmcdaid/Overlapping-NMI If a + d = n, and therefore b = c = 0, then the two vectors are in complete agreement. The lack of information between two vectors is defined: H(Xi|Yj) =H(Xi, Yj)−H(Yj) =h(a, n) + h(b, n) + h(c, n) + h(d, n) − h(b+ d, n)− h(a+ c, n) (1) where h(w, n) = −w log2 wn . There is an interesting technicality here. Imagine a pair of clusters but where the memberships have been defined randomly. There is a possibility that there will be a small amount of mutual information, even in the situation where the two vectors are negatively correlated with each other. In extremis, if the two vectors are near complements of each other, mutual information will be very high. We wish to override this and define that there is zero mutual information in this case. This is defined in equation (B.14) of [1]. We also use this restriction in our proposal. H(Xi|Yj) = { H(Xi|Yj) if h(a, n) + h(d, n) ≥ h(b, n) + h(c, n) h(c+ d, n) + h(a+ b, n) otherwise (2) This allows us to compare vectors Xi and Yj , but we want to compare the entire matrices X and Y to each other. We will follow the approximation used by [1] here and match each vector in X to its best match in Y , H(Xi|Y ) = min j∈{1,...KY } H(Xi|Yj) (3) then summing across all the vectors in X ,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A revisit to evaluating accuracy of community detection using the normalized mutual information

Normalized Mutual Information (NMI) has been widely used to evaluate accuracy of community detection algorithms. In this notes we show that NMI is seriously affected by systematic error due to finite size of networks, and may give wrong estimate of performance of algorithms in some cases. A simple expression for the estimate of this error is derived and tested numerically. We suggest to use a n...

متن کامل

An Optimized Firefly Algorithm based on Cellular Learning Automata for Community Detection in Social Networks

The structure of the community is one of the important features of social networks. A community is a sub graph which nodes have a lot of connections to nodes of inside the community and have very few connections to nodes of outside the community. The objective of community detection is to separate groups or communities that are linked more closely. In fact, community detection is the clustering...

متن کامل

Hierarchical mutual information for the comparison of hierarchical community structures in complex networks

The quest for a quantitative characterization of community and modular structure of complex networks produced a variety of methods and algorithms to classify different networks. However, it is not clear if such methods provide consistent, robust, and meaningful results when considering hierarchies as a whole. Part of the problem is the lack of a similarity measure for the comparison of hierarch...

متن کامل

Research of Blind Signals Separation with Genetic Algorithm and Particle Swarm Optimization Based on Mutual Information

Blind source separation technique separates mixed signals blindly without any information on the mixing system. In this paper, we have used two evolutionary algorithms, namely, genetic algorithm and particle swarm optimization for blind source separation. In these techniques a novel fitness function that is based on the mutual information and high order statistics is proposed. In order to evalu...

متن کامل

Research of Blind Signals Separation with Genetic Algorithm and Particle Swarm Optimization Based on Mutual Information

Blind source separation technique separates mixed signals blindly without any information on the mixing system. In this paper, we have used two evolutionary algorithms, namely, genetic algorithm and particle swarm optimization for blind source separation. In these techniques a novel fitness function that is based on the mutual information and high order statistics is proposed. In order to evalu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1110.2515  شماره 

صفحات  -

تاریخ انتشار 2011